Using R

Michael Franke

topics for today

 

  • basics of R
  • tidyverse
  • tidy data
  • data wrangling
  • plotting
  • Rmarkdown

R for data science

R4DS

R4DS cover

freely available online: R for Data Science

data science?

data scientist

read more

what R is (not)

  • special purpose programming language for data science statistical computing

    • statistics, data mining, data visualization
  • authority says to tell you to not think of R as a programming language!

  • think of it as a tool optimized for creating scripts to manipulate, plot and analyze data

diagram from 'R for Data Science'

past & present

  • a trusted old friend from 1993
  • still thriving
    • see TIOBE ranking (based on search query results)

TIOBE index

extensibility & community support

a lot of innovation and development takes place in packages

go browse some 12,000 packages on CRAN

 

install packages (only once)

install.packages('tidyverse')

load packages (for every session)

library(tidyverse)

base R & package functions

base R functionality is always available

x = seq(from = 1, to = 10, length.out = 1000)
plot(x,x^2)

packages bring extra functions

library(ggplot2)
ggplot2::qplot(x,x^2)

tidyverse

 

overview of tidyverse

tidyverse website

RStudio

integrated development environment for R

RStudio screenshot

cheat sheet

basics of R

overview

  • basic properties of R
  • data types
    • numbers, vectors & matrices
    • characters & factors
    • lists, data.frames & tibbles
  • probability distributions
  • functional programming elements
  • functions

for all base R stuff, check the R manual

general remarks about R

  • free (GNU General Public License)
  • interpreted language
6 * 7
## [1] 42
  • vector/matrix based
x = c(1,2,3)
x + 1
## [1] 2 3 4
  • supports object-oriented, procedural & functional styles

  • convenient interfaces to other languages

  • assignment in both directions possible

x <- 3
3 -> y
x == y
## [1] TRUE

help

 

help('qplot')
qplot {ggplot2} R Documentation
Quick plot

Description

qplot is a shortcut designed to be familiar if you're used to base plot(). It's a convenient
wrapper for creating a number of different types of plots using a consistent calling scheme.  
It's great for allowing you to produce plots quickly, but I highly recommend learning ggplot()
as it makes it easier to create complex graphics.

Usage

qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
  geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
  main = NULL, xlab = deparse(substitute(x)),
  ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)

numbers, vectors & matrics

  • standard number precision is double
typeof(2)
## [1] "double"
  • vectors are declared using c()
x = c(10,20,30)
x
## [1] 10 20 30
  • everything is a vector (possibly length 1)
c(length(200), length("huhu"))
## [1] 1 1
  • indexing starts at 1
x[2]
## [1] 20

numbers, vectors & matrices (2)

  • column-major mode
m = matrix(c(1,2,3,4,5,6), nrow = 2)
m
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
m[1,]
## [1] 1 3 5
  • vectors are column vectors
m %*% x ## dot product
##      [,1]
## [1,]  220
## [2,]  280

character vectors and factors

  • strings are called characters
typeof("huhu")
## [1] "character"
  • vector of characters
chr.vector = c("huhu", "hello", "huhu", "ciao")
chr.vector
## [1] "huhu"  "hello" "huhu"  "ciao"
  • factors track levels
factor(chr.vector)
## [1] huhu  hello huhu  ciao 
## Levels: ciao hello huhu
  • ordered factors arrange their levels
factor(chr.vector, ordered = T, 
       levels = c("huhu", "ciao", "hello"))
## [1] huhu  hello huhu  ciao 
## Levels: huhu < ciao < hello

lists & data frames

  • lists are key-value pairs
my.list = list(dudu = 1,
               chacha = c("huhu", "ciao"))
  • data frames as lists of same-length vectors
exp.data = data.frame(trial = 1:5,
              condition = factor(c("C1", "C2", "C1", 
                                   "C3", "C2"),
                                 ordered = T),
              response = c(121, 133, 119, 102, 156))
exp.data
##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156
  • access colums
exp.data$condition
## [1] C1 C2 C1 C3 C2
## Levels: C1 < C2 < C3
  • access rows
exp.data[3,]
##   trial condition response
## 3     3        C1      119

tibbles

  • tibbles are data frames in the tidyverse
as.tibble(exp.data)
## # A tibble: 5 x 3
##   trial condition response
##   <int> <ord>        <dbl>
## 1     1 C1             121
## 2     2 C2             133
## 3     3 C1             119
## 4     4 C3             102
## 5     5 C2             156
  • compare to previous data frame
exp.data
##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156

   

   

  • some differences
my.tibble    = tibble(x = 1:10, y = x^2)      ## dynamic construction possible
my.dataframe = data.frame(x = 1:10, y = x^2)  ## ERROR :/

probability distributions in R

  • R has many built-in probability distributions
    • normal distribution
    • beta distribution
    • …
  • additional distributions supplied by packages
    • multi-variate normal
    • Dirichlet
    • …
  • each distribution mydist is associated with four functions:
    1. dmydist(x, ...) gives the probability (mass/density) \(f(x)\) for x
    2. pmydist(x, ...) gives the cumulative distribution function \(F(x)\) for x
    3. qmydist(p, ...) gives the value \(x\) for which p = pmydist(x, ...)
    4. rmydist(n, ...) returns n samples from the distribution

example

x = seq(-5, 5, length.out = 1000)
y = dnorm(x, mean = 1, sd = 0.5)
plot(x,y)

maps & pipes (tidyverse)

  • mapping
data = tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) )
map_dbl(data, mean)
##     IQ     RT 
## 113.75  75.75
  • piping
tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) ) %>% 
  map_dbl(mean)
##     IQ     RT 
## 113.75  75.75

functions

  • named custom functions
crazy.operation = function(x,y) {
  x+y
}
crazy.operation(2,3)
## [1] 5
  • anonymous functions
tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) ) %>% 
  map_dbl(function(i) {max(i)-min(i)})
## IQ RT 
## 25 40

tidy (rectangular) data

types of data

 

data from experimental (psych) studies is usually rectangular data

 

examples of (usually) non-rectangular data:

  • image data
  • sound data
  • video data
  • corpora
  • …

 

the tidyverse is particularly efficient for dealing with tidy rectangular data

rectangular data

library(nycflights13)
nycflights13::flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

study Chapters 5 and 12 from R for Data Science

tidy data

  1. each variable is a column
  2. each observation is a row
  3. each value is a cell

 

tidy data

untidy data 1

this is untidy if we want to analyze/plot grade as a function of exam type

grades = tibble(name = c('Michael', 'Noa', 'MadEye'),
                midterm = c(3.7, 1.0, 1.3),
                final = c(4.0, 1.3, 1.0))
grades
## # A tibble: 3 x 3
##   name    midterm final
##   <chr>     <dbl> <dbl>
## 1 Michael     3.7   4  
## 2 Noa         1     1.3
## 3 MadEye      1.3   1

to tidy up, we need to gather columns which are not separate variables into a new column

grades %>% gather('midterm', 'final', 
                  key = 'exam', value = 'grade')
## # A tibble: 6 x 3
##   name    exam    grade
##   <chr>   <chr>   <dbl>
## 1 Michael midterm   3.7
## 2 Noa     midterm   1  
## 3 MadEye  midterm   1.3
## 4 Michael final     4  
## 5 Noa     final     1.3
## 6 MadEye  final     1

untidy data 2

this is untidy if we want to analyze grade as a function of participation

results = tibble(name = c('Michael', 'Noa', 'MadEye', 
                          'Michael', 'Noa', 'MadEye'),
                 what = rep(c('grade', 'participation'), 
                            each = 3),
                 howmuch = c(3.7, 1.0, 1.0, 55, 100, 100))
results
## # A tibble: 6 x 3
##   name    what          howmuch
##   <chr>   <chr>           <dbl>
## 1 Michael grade             3.7
## 2 Noa     grade             1  
## 3 MadEye  grade             1  
## 4 Michael participation    55  
## 5 Noa     participation   100  
## 6 MadEye  participation   100

to tidy up, we need to spread cells from a row out over several columns

results %>% spread(key = 'what', value = 'howmuch')
## # A tibble: 3 x 3
##   name    grade participation
##   <chr>   <dbl>         <dbl>
## 1 MadEye    1             100
## 2 Michael   3.7            55
## 3 Noa       1             100

ggplot

Layered grammar of graphics

  • structured description language for plots (relevant for data science)
  • smart system of defaults
  • multiple layers:
    • data + transformation + geom. object + aesthetics
  • basic components:
    • data
    • coordinate system
    • statistical transformation
      • means, standard errors, bins, …
    • scales
      • continuous, discrete, …
    • geometric object
      • how to visualize the data (points, bars, lines, …)
    • aesthetic mapping
      • point shape, size, color, …
    • facets

for background see Wickham (2010)

example

fully explicit

ggplot() +
  layer(
    data = diamonds,
    mapping = aes(x = carat, y = price),
    geom = "point",
    stat = "identity",
    position = "identity"
  ) +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()

with syntactic sugar and defaults

diamonds %>% ggplot(aes(carat, price)) + geom_point()

general structure of a ggplot call

screenshot_cheat_sheet

from the cheat sheet

Rmarkdown

why Rmarkdown

 

  • prepare, analyze & plot data right inside your document

  • hand over all of your work in one single, easily executable chunk
    • support reproducible and open research
  • export to a variety of different formats

Rmarkdown formats

flow of information

 

Rmarkdown info flow

 

Rmarkdown formats

markdown

headers & sections

# header 1
## header 2
### header 3

emphasis, highlighting etc.

*italics* or _italics_
**bold** or __italics__
~~strikeout~~

links

[link](https://www.google.com)

inline code & code blocks

`function(x) return(x - 1)`

cheat sheet

Rmarkdown

extension of markdown to dynamically integrate R output

multiple output formats:

  • HTML pages, HTML slides (here), …
  • PDF, LaTeX, Word, …

cheat sheet and a quick tour

supports LaTeX

inline equations with $\theta$

equation blocks with

$$ \begin{align*} E &= mc^2 \\
& = \text{a really smart forumla}
\end{align*} $$

 

caveat

LaTeX-style formulas will be rendered differently depending on the output method:

  • PDF-LaTeX gives you genuine LaTeX with (almost) all abilities
  • HTML output uses MathJax to emulate LaTeX-like behavior
    • only LaTeX-packages & functionality emulated in JS will be available